2,671 research outputs found
Exploring signature multiplicity in microarray data using ensembles of randomized trees
A challenging and novel direction for feature selection research
in computational biology is the analysis of signature multiplicity. In this work, we propose to investigate the eect of signature multiplicity on feature importance scores derived from tree-based ensemble methods. We show that looking at individual tree rankings in an ensemble could highlight the existence of multiple signatures and we propose a simple
post-processing method based on clustering that can return smaller signatures with better predictive performance than signatures derived from the global tree ranking at almost no additional cost
DMFSGD: A Decentralized Matrix Factorization Algorithm for Network Distance Prediction
The knowledge of end-to-end network distances is essential to many Internet
applications. As active probing of all pairwise distances is infeasible in
large-scale networks, a natural idea is to measure a few pairs and to predict
the other ones without actually measuring them. This paper formulates the
distance prediction problem as matrix completion where unknown entries of an
incomplete matrix of pairwise distances are to be predicted. The problem is
solvable because strong correlations among network distances exist and cause
the constructed distance matrix to be low rank. The new formulation circumvents
the well-known drawbacks of existing approaches based on Euclidean embedding.
A new algorithm, so-called Decentralized Matrix Factorization by Stochastic
Gradient Descent (DMFSGD), is proposed to solve the network distance prediction
problem. By letting network nodes exchange messages with each other, the
algorithm is fully decentralized and only requires each node to collect and to
process local measurements, with neither explicit matrix constructions nor
special nodes such as landmarks and central servers. In addition, we compared
comprehensively matrix factorization and Euclidean embedding to demonstrate the
suitability of the former on network distance prediction. We further studied
the incorporation of a robust loss function and of non-negativity constraints.
Extensive experiments on various publicly-available datasets of network delays
show not only the scalability and the accuracy of our approach but also its
usability in real Internet applications.Comment: submitted to IEEE/ACM Transactions on Networking on Nov. 201
Automated multimodal volume registration based on supervised 3D anatomical landmark detection
We propose a new method for automatic 3D multimodal registration based on anatomical landmark detection. Landmark detectors are learned independantly in the two imaging modalities using Extremely Randomized Trees and multi-resolution voxel windows. A least-squares fitting algorithm is then used for rigid registration based on the landmark positions as predicted by these detectors in the two imaging modalities. Experiments are carried out with this method on a dataset of pelvis CT and CBCT scans related to 45 patients. On this dataset, our fully automatic approach yields results very competitive with respect to a manually assisted state-of-the-art rigid registration algorithm
Classifying pairs with trees for supervised biological network inference
Networks are ubiquitous in biology and computational approaches have been
largely investigated for their inference. In particular, supervised machine
learning methods can be used to complete a partially known network by
integrating various measurements. Two main supervised frameworks have been
proposed: the local approach, which trains a separate model for each network
node, and the global approach, which trains a single model over pairs of nodes.
Here, we systematically investigate, theoretically and empirically, the
exploitation of tree-based ensemble methods in the context of these two
approaches for biological network inference. We first formalize the problem of
network inference as classification of pairs, unifying in the process
homogeneous and bipartite graphs and discussing two main sampling schemes. We
then present the global and the local approaches, extending the later for the
prediction of interactions between two unseen network nodes, and discuss their
specializations to tree-based ensemble methods, highlighting their
interpretability and drawing links with clustering techniques. Extensive
computational experiments are carried out with these methods on various
biological networks that clearly highlight that these methods are competitive
with existing methods.Comment: 22 page
Optimal model parameters for multi-objective large-eddy simulations
A methodology is proposed for the assessment of error dynamics in large-eddy simulations. It is demonstrated that the optimization of model parameters with respect to one flow property can be obtained at the expense of the accuracy with which other flow properties are predicted. Therefore, an approach is introduced which allows to assess the total errors based on various flow properties simultaneously. We show that parameter settings exist, for which all monitored errors are "near optimal," and refer to such regions as "multi-objective optimal parameter regions." We focus on multi-objective errors that are obtained from weighted spectra, emphasizing both large- as well small-scale errors. These multi-objective optimal parameter regions depend strongly on the simulation Reynolds number and the resolution. At too coarse resolutions, no multi-objective optimal regions might exist as not all error-components might simultaneously be sufficiently small. The identification of multi-objective optimal parameter regions can be adopted to effectively compare different subgrid models. A comparison between large-eddy simulations using the Lilly-Smagorinsky model, the dynamic Smagorinsky model and a new Re-consistent eddy-viscosity model is made, which illustrates this. Based on the new methodology for error assessment the latter model is found to be the most accurate and robust among the selected subgrid models, in combination with the finite volume discretization used in the present study
Context-dependent feature analysis with random forests
In many cases, feature selection is often more complicated than identifying a
single subset of input variables that would together explain the output. There
may be interactions that depend on contextual information, i.e., variables that
reveal to be relevant only in some specific circumstances. In this setting, the
contribution of this paper is to extend the random forest variable importances
framework in order (i) to identify variables whose relevance is
context-dependent and (ii) to characterize as precisely as possible the effect
of contextual information on these variables. The usage and the relevance of
our framework for highlighting context-dependent variables is illustrated on
both artificial and real datasets.Comment: Accepted for presentation at UAI 201
Ensembles of extremely randomized trees and some generic applications
peer reviewedIn this paper we present a new tree-based ensemble method called “Extra-Trees”. This algorithm averages predictions of trees obtained by partitioning the inputspace with randomly generated splits, leading to significant improvements of precision, and various algorithmic advantages, in particular reduced computational complexity and scalability. We also discuss two generic applications of this algorithm, namely for time-series classification and for the automatic inference of near-optimal sequential decision policies from experimental data
- …